Policy Approximation and its Advantages

In Policy gradient methods, the policy can be parameterized in any way, as long as …

  1. differentiable with respect to its parameters i.e. \(\nabla \pi(a|s,\boldsymbol{\theta})\) exists
  2. finite for all \(s\in S\) , \(a \in A(s)\) and \(\boldsymbol{\theta} \in \mathbb R^{d'}\)

In practice, to ensure exploration we generally require that the policy never becomes deterministic i.e. \(\pi(a|s,\boldsymbol{\theta}) \in (0,1)\)

Policy based methods offer useful ways of dealing with continuous action spaces, as we describe later in Section 13.7

If the action space is discrete and not too large, then a natural and common kind of parameterization is to form parameterized numerical preferences \(h(s,a,\boldsymbol{\theta}) \in \mathbb{R}\) for each state-action pair i.e. action space가 이산적이고 그렇게 크지 않은 경우, action에 대한 선호도를 parameterization한다.

The actions with the highest preferences \(\to\) highest probabilities of being selected.

\[\pi(a|s;\boldsymbol{\theta}) \overset{.}{=} \frac{e^{h({s,a,\boldsymbol{\theta}})}}{\sum_{b \in A(s)} e^{h(s,b,\boldsymbol{\theta})}}\]

We call this kind of policy parameterization soft-max in action preferences.

The action preferences themselves can be parameterized arbitariliy. For example, they might be computed by a deep artificial neural network (ANN), where \(\boldsymbol{\theta}\) is the vector of all the connection weights of network. Or the preferences could simply be linear in features, using feature vectors \(\bold{x}(s,a) \in \mathbb{R}^{d'}\) constricted by any of the methods.

\[\begin{aligned} & h(s,a,\boldsymbol{\theta}) = \boldsymbol{\theta}^T\bold{x}(s,a) \\ \end{aligned}\]
